Green Library 


DIRECT TESTING OF SPEAKING PROFICIENCY: 
THEORY AND APPLICATION 


Proceedings of a Two-Day Conference Conducted 

by Educational Testing Service in Cooperation 

with the U.S. Interagency Language Round Table 

and the Georgetown University Round Table on 
Languages and Linguistics 


John L. D. Clark, ed. 


” 


Educational Ра DA Princeton, NJ 


| * 


CONTENTS 


Page 
Preface v 
> Development and Current Use of the 
FSI Oral Interview Test Howard E.Sollenberger 1 
Interview Testing in Non-European 
Languages William Lovelace 15 
Measuring Second Language Speaking 
Ability in New ‘Brunswick's Senior High 
Schools Murielle Albert 19 
Using the FSI Interview as a Diagnostic 
Evaluation Instrument Stephen L. Graham 31 
O Direct Testing of Speaking Skills ina 
Criterion-Referenced Mode Robert В. Franco 41 
Oral Proficiency Testing in New Jersey 
Bilingual and English as a Second Language 
Teacher Certification Richard W. Brown 65 
Adaptation of the FSI Interview Scale 
for Secondary Schools and Colleges Claus Reschke 75 
Interview Techniques and Scoring Criteria 
at the Higher Proficiency Levels Randall L. Jones 89 


Testing Speaking Proficiency through 
Functional Dialogues I. Е. Roos-Wi ¡gh 105 


Бы. and Limitations of Interview-Based 
Language Testing: Are We Asking Too Much 


O of the Interview? Robert Lado 113 
Measuring Foreign Language Speaking 
Proficiency: A Study of Agreement Among 
Raters Marianne L. Adams 129 
Independent Rating in Oral Proficiency 
Interviews John Quiñones 151 
Third Rating of FSI Interviews Pardee Lowe, Jr. 159 
Determining the Effect of Uncontrolled 
Sources of Error in a Direct Test of Oral 
Proficiency and the Capability of the 
Procedure to Detect Improvement Following 
Classroom Instruction Karen_A. Mullen 171 


DEVELOPMENT AND CURRENT USE OF THE 


FSI ORAL INTERVIEW TEST 


Howard E. Sollenberger 


Director, Foreign Service Institute (retired) 


DEVELOPMENT AND CURRENT USE OF THE FSI ORAL INTERVIEW TEST 
Howard Е. Sollenberger 


I address you today, not as a specialist in foreign language 
testing or as a linguist, but rather as an administrative philosopher and 
historian. Since I no longer administer, I can perhaps be permitted 
to give you some history of the development of the foreign language 
oral interview tests of the Foreign Service Institute (FSI) and to 
philosophize on the subject of this conference, "Direct Testing of 


Speaking Proficiency: Theory and Application." 


1 hope Г am not presumptuous in assuming that a brief historical case 
study of the circumstances under which direct interview testing was first 
attempted on any significant scale, and how it developed into a system 
used throughout the federal government, would be helpful as background for 
our deliberations. Certainly we will want to examine both the advantages, 
and the implications, of putting theory into practice, in institution- 
alizing systems by which we attempt to measure and differentiate human 
performance. 


To paraphrase Dean Acheson, you might say that I was "present at the 
creation" or, perhaps more accurately, at the incubation of the oral 
interview testing system developed at the FSI. While it may now be rather 
dim in our memaries, we were in a period of "cold war" intensification in 
the early 1950s. It had wide and significant ramifications in our public 
life, and even in education. By the late 1950s it would, among other 
things, generate the National Defense Education Act, which was to support 
the upgrading of science, mathematics, and foreign area and language 
studies ‚in American education. Meanwhile, with the impetus of the Korean 
War and the experience of having been unprepared for the global war a 
decade earlier, the Civil Service Commission in 1952 was directed, under 
the National Mobilization and Manpower Act, ta inventory and develop 
a register of persons in government who had skills, background, and 
experience in various foreign areas and languages. 


Following normal bureaucratic procedures, the Civil Service Commis- 
sion created an interagency committee to study the problem and recommend 
procedures. At early meetings it became apparent that, if an inventory 
were to serve any useful purpose, some means of defining and differen- 
tiating levels of foreign language proficiency and area expertise would be 
necessary. The old labels of fair/good/fluent/bilingual were obviously 


inadequate. 


Dr. Henry Lee Smith (then dean of the FSI Language School), the State 
Department's representative of the interagency committee, pressed for a 
system and the development of criteria that would differentiate testable 
levels between "no knowledge" of a given foreign language and "total 
mastery." He was promptly named to head a subcommittee to prepare 
definitions and so-called working papers. As Dr. Smith's alternate on the 
committee, I became involved as a coconspirator in trying to get the 
federal government to realistically face personnel deficiencies in area 
expertise and foreign language skills. 


225 


As it developed, there was not only difference of opinion, but also 
opposition to the concept. There was concern in certain agencies that 
through the proposed survey and the establishment of a national register, 
the Civil Service Commission would further interfere in the personal 


Fiefdoms of the various agencies. There was also fear that testing based 
on new ebsclute standards woulc prove embarrassing to many employees who 
had claimed "fluency" in e foreign language on their applications for 
employment. To make a long story short, а compromise was reached that 
provided for each agency to conduct its own survey using definitions and 
criteria established by the committee. Testing would be optional. 


There were five different factors considered in defining and differ- 
entiating levels of area expertise: systematic area training (A), basic 
social science training (5), professional experience in an area (PA), 
professional experience related to an area (PE), and residence in an 
area (AR). Three to five differentiated levels were defined under each 
factor. 


Under the language proficiency section, symbolized by the letter L, 
six differentiated levels were defined. To avoid complicating the task, e 
no effort was made to separate the components of language proficiency, 
which were generally considered to be comprehension of oral production, 
speaking proficiency, reading proficiency and comprehension, and writing. 
At the base of the scale, 1-1 was defined as "no proficiency in either 
reading or speaking a foreign language." 


The upper end of the scale, L-6, was defined as "sufficient pro- 
ficiency in speaking, reading and writing to negotiate oral and written 
agreements and to thoroughly understand the press, popular and classical 
literature and official documents." It was noted that "this category is 
reserved for bilingual or native speakers of the language." 


It was proposed that category L-4 be considered as the minimum 
proficiency level for inventory purposes. This was defined as "sufficient 
proficiency in speaking a language to conduct ordinary routine business 
conversations and to read general non-technical material." It was noted 
that "this level of proficiency might normally be acquired by 9 to 12 
months of intensive language training or the equivalent in part-time 
study, depending on the difficulty of the language." © 


Bureaucratic foot-dragging, a change in the administration, and 
winding down of the Korean War resulted in the whole project being 
shelved. 


However, at the FSI, enough interest had been generated in the 
potential usefulness of this approach to stimulate further refinement of 
the scale and to experiment with structured oral interview testing of 
students. 


The second impetus came in 1955, when Loy Henderson, then Deputy 
Undersecretary of State; decided to conduct a survey of foreign language- — - 


265 


skills in the Foreign Service. Up to that time there had never been an 
inventory of language skills in the Foreign Service. Mr. Henderson was 
motivated by a conviction that post-war diplomacy would increasingly 
require face-to-face communication with people around the world as well 
as between government representatives and diplomats. In spite of some 
opposition within the Foreign Service, Mr. Henderson insisted that the 
survey be followed by testing. He also intended to tie promotions to 
tested Foreign language proficiency. This was serious business in the 
highly competitive Foreign Service. It was also serious business for the 
FSÍ and those who would design and conduct the tests. 


Testing of the 1952 definitions of L-1 through L-6 on some 200 
officers showed them to be inadequate for the purpose of a self-appraisal 
survey of the Foreign Service. It became apparent that speaking and 
reading proficiencies would have to be separately determined. From this 
emerged the L and R scales, with the speaking (oral production) scale (L) 
rac ded from 1 to 6, and reading facility (R) differentiated from 
lto 5. 


With this instrument a self-appraisal survey was conducted in the 
Foreign Service. It revealed that less than half of the 4,041 regular, 
reserve, and staff officers surveyed had a "useful to the service" 
proficiency in French, German, or Spanish. (These three languages, alon 
with English, were considered the "world languages" of diplomacy. 
"Useful" was then defined as "sufficient control of the structure of a 
language, and adequate vocabulary, to handle routine representation 
requirements and professional discussions within one or more special 
fields, and--with the exception of such languages as Chinese, Japanese, 
Arabic, etc.--the ability to read non-technical news or technical writing 
in a special field." This was the L-4, R-3 level as defined in the 
self-appraisal scales. 


These findings led to a new language policy, announced by the 
Secretary of State on November 2, 1956. This policy was based on the 
premise that foreign language skills are vital in the conduct of foreign 
affairs. Therefore, "each officer Would] be encoursged to acquire a 
'useful' knowledge of two (2) foreign languages, 8s well as sufficient 
command of the language of each post of assignment to be able to use 
greetings, ordinary social expressions and numbers; to ask simple 
questions and give simple directions; and to recognize proper names, 
street signs and office and shop designations." It further stated: 
"Evidence of achievement will be verified by tests administered by the 
Foreign Service Institute." 


Having been committed to testing, FSI was under pressure to develop 
reliable test procedures. As Claudia P. Wilds pointed out in her paper 
"The Oral Interview Test," published in 1975 by the Center for Applied 
Linguistics in Testing Language Proficiency: "Both the scope and the 
restrictions of the testing situation provided problems and requirements" 
previously unknown in language testing. 


ча. 


In the course of developing and refining oral interview test рго- 
cedures, Professor John B. Carroll, then of Harvard, was consulted. This 
led to a revision of the differentiated levels of proficiency and the 
redesignation of the symbols and levels. The symbol L was changed to 5 
о = 
for the reading scale. Each scale was differentiated into six levels, 
numbered from O to 5. 


Since this provided, for the first time, officially approved perform- 
ance and criterion-based definitions that testers, instructors, and 
administrators found useful, the system rapidly became institutionalized 
and the S and R symbols became part of the jargon. 


Not surprisingly, problems began to emerge. Dfficers being tested 
complained that different testing teams applied different standards, 
particularly in testing different languages. For example, it was commonly 
believed--and with some justification--than ап 5-3 rating was much 
tougher to get in French than in the so-called hard or esoteric languages. 
It was also rumored that students tested by their own instructors seemed 
to fare better than those who simply came in for tests. Testers seemed 
to be more critical in judging the performance of those whom they cid not 
know through a teacher-student relationship. In some cases. the rank and 
age of the officers were seen to influence the rating. Informally there 
developed what became known as the "compassionate" S-3 rating. There was 
also evidence that some testers seemed to be unduly influenced by the 
personalities and cooperativeness of persons being tested. 


With mandatory testing of Foreign Service officers announced in 1957, 
and with assignments and promotions to be influenced by the results, these 
problems had to be solved. An independent testing unit was established in 
July 1958, with Frank A. Rice as head of the unit and Claudia Wilds as his 
assistant. It was through the collaboration of these two people that a 
significant breakthrough came in standardizing oral testing procedures. A 
checklist was developed that contained five "factors": accent, grammar, 
vocabulary, fluency, and comprehension. Considerable work went into 
selecting these factors. The criterion was that they should be of a 
sufficiently general nature that they would apply equally well to all 
languages. Each factor was subdivided as a six-point descriptive scale, 
with "polar" terms X (extremely poor or inadequate) and Y (extremely good, 
accurate, or complete). 


As Frank Rice pointed out in an article entitled "The Foreign 
Service Institute Tests Language Proficiencies" .inguistic Reporter, May 
1959): "The original purpose of the Check List was to help counterbalance 
the inherent subjectivity of the testing procedure by providing agreement 
about what aspects of the performance were to be observed, a control on 
the attention of the observers, and a system of notation that would make 
judgments of different observers more nearly comparable. 

"There is no doubt that the Check List accomplished its original 
purpose: This-was-expected; What наз quite unexpected-was what-emerged 


ele 


From statistical analysis. This provided basic evidence of a high degree 
of consistency in the subjective judgments of the examiners. The instru- 
ment could thus serve not only as a useful record, but also as a highly 
accurate predictor." 


It also provided a means for training testers. Claudia Wilds, who 
was appointed head of the testing unit in 1963, subsequently developed a 
weighted scoring system for the checklist. Among other things, this 
provided a means for occasional verifications of the checklist profiles 
and seemed to keep examiners in all languages reasonably in line with each 


other. 


Further evidence of the success of this system was the sharp drop-off 
of complaints from persons being tested, and general acceptance of the 
results even for critical personnel decisions. Also, use of the rating 
scale and test results began to spread. With some modifications, the CIA 
developed a similar system, and the United States Information Agency and 
the Agency for International Development joined with the Department of 
State in using the FSI-developed standards and testing facilities. 


Even the Congress used them, demanding reports based on FSI standards 
to show progress toward compliance with a legislative mandate that the 
Department of State "designate every Foreign Service officer position in a 
foreign country whose incumbent should have a useful knowledge of a 
language or dialect common to such country [and that] each position so 
designated... be filled only by an incumbent having such knowledge" (Sec. 
578 Foreign Service Act of 1946). 


With the spreading use, in the 1960s, of the proficiency rating 
scale to other agencies, including the Defense Language Institute and the 
Peace Corps, it became apparent that the definitions should be further 
revised and standardized among agencies. Representatives of the FSI, the 
CIA, the Defense Language Institute, and the Civil Service Commission 
met in 1968 and developed a unified version of the definitions. These 
definitions are essentially the ones used today, and are shown as Appendix 
A of this paper. 


Now, twenty-five years after the inception of a criterion-referenced 
rating scale, it has been incorporated into the federal personnel manual 
for use throughout the U.S. government, and it has been adopted by the 
Supreme Headquarters of the Allied Powers in Europe. Educational Testing 
Service has joined the ranks of users, and increasing interest has 
been shown in academic circles--an interest that promises impact and 


contributions in the future. 


At the beginning of this paper, I stated my hope that we would 
examine the limitations and implications of applying theory to practice in 
the direct testing of speaking proficiency. As I have observed this in 
the government, it has become apparent to me that one of the principal 
limitations is the inability of this system to make meaningful judgments 
or to measure the most significant objective of human speech--effective 


E 


communication. By this 1 mean the effectiveness or lack thereof of an 
individual in listening to and fully understanding what he hears through 
the stetic of cultural differences and the peculiarity of personality, and 


the ability to communicate fully wi another p 
culture in such a way as to achieve understanding and cooperation. 


I have observed more than a few cases where I cringed at the thought 
that an individual would represent the United States overseas, even 
though he had been given e high S-4, R-4 language proficiency rating 
by our tests. The person's so-called language proficiency, while it 
may have been quite accurate in terms of technical skill, did not mean 
effectiveness in communication. In some cases, it may have enabled the 
person to misrepresent or foul up more effectively. This is to say that 
you can be a fool in any language or that you can put your foot in your 
mouth in any language. Nor does the fact of technical ability to use 
a foreign language without noticeable accent or grammatical errors mean 
that the person has something worth saying. I'm sure we all know people 
who talk nonsense fluently. 


On the other hand, I know people who butcher the language, whose 
accents are atrocious, and whose vocabularies are limited. For these 
reasons we give them low proficiency ratings. Yet, for some reason, 
some of them are effective communicators. 


You may rightly say that the tests we have developed do not measure 
this dimension of effective communication. Still, I know a number 
of administrators and even some linguists who do not understand the 
implication of this difference. 


I have also observed, in the application of these testing procedures 
in training situations, a tendency to train for success on the test score, 
or to the standards of the test, rather than for broad effectiveness in 
communication. It becomes more important to the teacher and the student 
that they achieve the S-3 level, rather than that they be effective 
communicators. These are not necessarily mutually exclusive objectives, 
but there are times when this is forgotten. 


I am not saying that these limitations, which deal with the use of 
measurement devices we create, should cause us to abandon our efforts to 
perfect and use such systems. It is, however, my conviction that these 
and other limitations must be recognized and that we have a continuing 
obligation to make these limitations known to end users. In this ne 
are no different from the scientist who makes a discovery that can, if 
properly used, be of benefit to human kind but that can also be misused. 
1 hope this conference will not ignore these responsibilities. 


SCOPE AND LIMITATIONS OF INTERVIEW-BASED LANGUAGE TESTING: 


ARE WE ASKING TOO MUCH OF THE INTERVIEW? 


Robert Lado 


Georgetown University 


------- 


| 


SCOPE AND LIMITATIONS OF INTERVIEW-BASED LANGUAGE TESTING: 
ARE WE ASKING TOO MUCH DF THE INTERVIEW? 


Robert Lado 


Introduction 


The Physician's Interview and Examination 


What happens when one goes to the doctor for a serious examination? 
The doctor begins by interviewing the patient: "How do you feel? What 
seems to be the problem? How long has that been bothering you? Have 
you had those symptoms before? When does it hurt? How is your appetite? 
Are you able to sleep at might? What is your normal weight? Have you 
been losing weight lately?" And so forth. 


Your attitude is one of cooperation with the physician; that is, you 


do not try to mislead the doctor or nide your symptoms. Yet, as a rule, 
the doctor does not make a serious, final diagnosis directly from the 
interview and first-hand observation of your appearance and behavior. 
Questions are raised in the doctor's mind. Mental notes are made as the 
interview proceeds. Hypotheses develop and are often discarded to make 


way for other possibilities. 


Depending on the observations made during the interview, the doctor 
proceeds with a number of specific tests. The doctor or a trained nurse 
takes your exact weight instead of accepting your report or making an 
estimate from your height and the look of your waistline. Stunt men at 
carnivals can make remarkably accurate guesses of your weight by simply 
looking at you, and they bet they can guess within five pounds of it or 
you win a prize. Yet your physician asks you to step on the scale and 
measures your weight to within a pound or less. The carnival estimator 
bets he can come within a ten-pound range, and he does not always win. 
The physician would not even consider recording a sharp-eye estimate. 


In addition, the doctor may listen to your heart, check your pulse, 
or listen through a stethoscope as he taps your chest. He or she does 
not just hold a tight grip on your arm to estimate your blood pressure as 
circulation begins to pulse through. A sphygmomanometer measures that 
pressure so a reading can be made from the height of a mercury column 


against a scale or from a needle pointing to a circular scale. And 
notice that to take your pulse rate the physician or the nurse looks at a 
watch as a count of the pulsations is made. It is easy to train yourself 


to count seconds quite accurately, yet physicians prefer to look at 
watches. 


The doctor may take a chest X-ray and examine it, make an electro- 
cardiogram, tap your knee for reflexes, and look at your throat, sars, and 
nose. If there is a hearing problem, the doctor does not just whisper to 
see if you hear; he or she asks For an audiology test, which measures 
responses at different sound frequencies. 


-116- 


The physician may take one or more blood samples, or collect а urine 
specimen, which will be sent to a laboratory and tested for sugar, 
infection, albumen, cr whatever, 


Only after the doctor has collected the results of the various 
specific tests and interpreted them together with the interview does he or 
she attempt to reach a final diagnosis and prescribe treatment. If the 
results are inconclusive or contradictory, additional tests are ordered. 
Where would modern medicine be if doctors depended exclusively on the 
interview and direct observation of patients? 


The Oral Interview Test (OIT) 


In the 01Т, a trained linguist (the counterpart of the physician) 
elicits samples of speech by asking questions, suggesting topics, and 
probing into the usage of the examinee. The examinee, often in an 
antagonist role, tries to exhibit the best usage and avoid pitfalls that 
might lower the rating. The trained examiner keeps on probing until 
satisfied that the true level of performance has been established or until 
time becomes a problem. 


Unlike the medical examination, the DIT does not lead to additional 
tests to obtain a more complete picture of competence in specific areas. 
Instead, the examiner searches for questions and topics that might elicit 


desired responses and exhibit weaknesses and competencies. 


lt may be argued that linguistic competence is less complex than the 
functioning of the human body, yet linguistic competence is one of the 
most complex achievements of a human being. In research on linguistic 
geography, it takes interviewers many hours of exploration with the aid of 
questionnaires to report the speech characteristics of a single informant. 


By contrast, in an DIT, which lasts from five to thirty minutes, the 
examiner immediately reaches the diagnosis or rating that says what the 
examinee can and cannot do in and with the language, or, to use the Civil 
Service ratings, that the examinee is native-bilingual, full professional, 
minimum professional, limited working, or elementary in speaking. 


From this observation of the examinee in conversation, the examiner 
decides finally and irrevocably if the examinee can perform full profes- 
sional functions through the language. And, because of the fact that 
there is a face-to-face conversation, the examination is considered a 
valid replication of professional function, which it is not. 


When the physician suspects there might be a problem related to the 
weight of a patient, he or she reaches for the exact weight measurement 
and does not trust an approximate estimate. The oral interview examiner, 
however, trusts the approximate estimate. When the physician suspects a 
hearing problem, he or she does not stop with direct observation, but 
studies the sudiogram-showing- thresholds st-various-frequencies-on the 


-117- 


sound spectrum. And the physician puts more trust in the audiogram, which 
separates the elements of sound into frequencies, than in the integrative 
informal test of speaking to determine if the patient hears normally or 
not. Yet, in the OIT, the examiner does not use any specific measures 
beyond direct observation of the behavior of the examinee because it is 
supposedly more valid to do so than to seek more precise information 
by means of additional tests of various elements. Where can language 
examinations go if we insist on exclusive reliance on direct impression 
examinations For our final diagnoses? 


Evaluation 


So far we have argued only by analogy, and analogy does not prove 
anything. But we would have to be blindfolded not to recognize that the 
analogy raises some interesting questions about the possible limitations 
of the technique. It seems to me that we are justified in assessing in 
a more formal way the Fundamental strengths and weaknesses of the OIT. In 
testing terms, this means inquiring formally into the validity, 
reliability, scorability, representativeness, and practicality of the 
test, and determining what it does and does not do well and how it 
can be modified or combined with other techniques to produce better 
results. 


Validity 

Validity is the most important single criterion to evaluate a test. 
It is critical because without validity all other criteria, including 
reliability, are worthless. Validity simply asks whether and to what 
extent a test measures what it claims to measure. There is no absolute 
and final answer to Lhe question of validity, since a test only samples 
what it purports to test. Instead, we search for evidence that supports 
or weakens its claim, and then, on the basis of all the evidence, we make 


a judgment. 


There are many ways we can seek evidence to answer the validity 
question. Some of the mast convincing evidence comes from (1) face 
validity, (2) content-of-sample validity, (3) native speaker performance, 
and (4) empirical or statistical validity. 


FACE VALIDIiY. The greatest strength of the OIT is its surface or 
face validity, i.e., the appearance on simple inspection that it tests 
speaking, which is what it claims to test. The OIT has all the appearance 
of testing speaking ability: it is actually a speaking performance on the 
part of the examinee and a speaking performance is not a substitute for 
speaking bu& speaking itself. 


1f we were to rely on face validity alone, we would give the OIT tre 
highest validity rating as a speaking test. Such a rating would be amply 
justified if speaking a language were as simple as riding a bicycle or 


-118- 


driving en automobile. By analogy, the OIT would be equivalent to the 
road test of a driver's examination. 


But mestering a language is more complex than driving a car, and оп 


the basis of the questions raised by our analogy with the physician's 
examination, we should go beyond face validity into a deeper evaluation of 
the OIT. Even in a driver's examination, it is common practice to take a 
written test prior to the road test. And the road test itself is not 
merely driving around the-nearest block but a series-of--tasks-that-probe 
the corpetence of the driver in various maneuvers. 


With regard to the 017, we notice immediately that it is a restricted 
sample of speaking that, as such, may or may not give a fully accurate 
picture of linguistic or communicative competence. This leads us to 
content-of-sample validity. 


CONTENT-OF-SAMPLE. Content in a language test refers to the language 
and the situations tested. We know that ianguage is a system of rules, 
patterns, and lexical items and their meanings used by a speech community 
to communicate and interact in carrying on the multiple functions typical 
of life in that community. We should, therefore, inquire into the content 
of the OIT with regard to grammatical system, vocabulary, pronunciation, 
situations, and fluency. 


Grammatical System. In the DIT the examinee may not have sufficient 
opportunity to ask questions, for example, or to use requests, invi- 
tations, or exclamations, or use various types of complex sentences 
or passive or reflexive constructions. The experienced examiner guards 
against such lacunae but may not be able to elicit utterances containing 
important elements of competence such as the different types of questions, 
including those of the yes/no, information, subject, verb phrase, 
predicate, and echo types, among others. 


We all agree that the total language system cannot be tested in one 
interview and that we must, therefore, be satisfied with a sample. But 
how is chat sample to be chosen? By subjective impressions? By error 
counts? By linguistic analysis? Without precise criteria concerning the 
sample, there is bound to be variation among interviewers and from one 
interview to another with regard to the elements elicited. An informal 
general list such as examiners often have in mind allows too much 
variation. 


ln a recorded OIF of Spanish, which lasted twenty minutes and yielded 
an 5 rating of 4, the examiners asked fifty-five questions and the 
examinee none (DeCesaris, 1977). The examiners made a clear effort to 
elicit the subjunctive and conditional forms, but they overlooked the area 
of interrogatives completely. 


Years ago, I was called as a consultant to evaluate an OIT under 
development for the Air Force to test illiterate Puerto Rican recruits in 


— — — spoken English. It was a carefully structured interview that sought to 


-119- 


test competence in a number of areas. On examining it, I discovered that 
it did not provide for questions to be put by the examinee. 


Vocabulary. We know that even full bilinguals do not have completely 
parallel competence in all Jexical areas of the two languages. I, for 
example, feel less competent to discuss psychology in Spanish than in 
English, because practically all my study of psychology was in English, 
but I feel more competent to discuss literature in Spanish. Should the 
topic be soccer, Г would again do better in Spanish; if it were current 
movies, 1 would do badly in both. Yet, on the basis of a conversation on 
some informally chosen topic, the DIT may report a rating of S-4, Full 
professional proficiency, which is described as "able to use the language 
fluently and accurately on all levels normally pertinent to professional 
needs," without necessarily sampling the lexical areas in which Full 
professional competence has been achieved. 


Pronunciation. The OIT provides a highly valid sample of an 
examinee's competence in pronunciation, with respect to both face validity 
and content-of-sample validity. Practically all the phonemes and phoneme 
sequences of the language and most of the intonation and rhythm patterns 
will be exhibited. There are problems with regard to scoring, but not 
with validity. 


Situational Content. One of the strengths of the OIT is that it 
represents performance in a communicative situation. This is more valid 
than reciting memorized texts as a measure of speaking, and it is more 
valid than a repetition test described by Politzer et al. (1974). It is 
more valid than the noise test, which is essentially a dictation with 
noise interference, as reported in Spolsky et al. (1968) and Gaies et al. 
(1977). 


By attempting to introduce different questions and tasks, the 
examiner tries to improve the situational content. In this sense the 
OIT can be more effective than a picture stimulus test if the examiners 
are experienced. Nevertheless, the OIT is not fully representative for 
two reasons. (1) The OIT is a test of conversational competence rather 
than of extended Formal speaking. It does not sample the ability of a 
professor to deliver a lecture to a class, or of an ambassador to give a 
public lecture, as ambassadors are often invited to do. (2) It does 
not sample sociolinguistic variations, which are sometimes critical in 
effective communication. Notice, for example, variations required 
in addressing men and women, older and younger persons, individuals of 
high status, and in-house employees of different sociolinguistic status. 
Of course, these differences could be deliberately sought out in the 
interviem and become part of it. The question would be then whether 
the OIT were too long. Would its spontaneity be hampered? Could these 


variations be tested by other means? 


Fluencv. Fluency is sampled quite adequately in the OIT. As with 
pronunciation, any problems with regard to fluency will be in scoring 
rather than in validity. Are all examiners rating the same thing when 
they rate fluency? Should it be more explicitly defined? 


-120- 


NATIVE SPEAKER PERFORMANCE. The OIT seems strong with regard to 
native speaker performance. All exam'nees would presumably perform at 
a rating level of 5 if tested in their native language. Yet there is one 

= , 
personality, and presence. Would all examinees give a typical performance 
each time if tested in their native language? We are intuitively 
aware that we do not always perform at our best under all circumstances, 
ls there any substance to this impression? 


Differences in performance among educated adults may not turn out to 
be of major importance, but differences among children are substantial, as 
reported by sociolinguistic studies of ghetto children. 1 recently made a 
sound movie of a two-year-old Spanish-speaking child learning to read 
Spanish. The parents had reported that he was able to read three books of 
an experimental series. Yet, when we attempted to film his performance, 
he did not read a single word, even though the filming was at home with 
his parents. The OIT is not a test for two-year-olds, of course, but it 
would be interesting to test some adult examinees in their netive language 
to see what performance they actually display. 


EMPIRICAL VALIDITY. A standard empirical validation of a test 15 its 
correlation with a valid criterion. The valid criterion could be the 
scores on a speaking test whose validity has been previously established. 
With the DIT we cannot use this approach because we simply do not have a 
fully validated and established speaking test. 


To obtain a more valid criterion, we will have to turn to (1) a more 
extended version of the OIT with adequate sampling of situations and 
language, (2) an increase in the number of graders or an increase in their 
competence, or (3) a combination of the above. If it turns out that the 
DIT correlátes highly with the longer and better-structured version scored 
by a group of qualified examiners, we would be justified in considering 
the DIT validated. 


I have not seen such a validation attempt. Instead, I have seen 
8 proposal that a shorter version be correlated with the full OIT to 
validate the shorter version. Obviously, if the shorter version 


correlated highly with the normal length DIT, we would gain by the 
practical advantage of its shortness. 


However, since we are still exploring possible limitations of the 
DIT, its validation with a longer, structured 011 scored by more than two 
judges would seem to be of greater interest. Another possibility is the 
use of in-depth interviews supplemented by additional tests. 


Reliability 


Reliability has to do with the stability of obtained scores. If 
scores fluctuate excessively for the same students on repeated adminis- 
trations, tie test is unreliable. The extent to which scores are reliable 


-121- 


is expressed as a correlation between two sets of scores made by the same 
students on the same test. In reliability, then, the test is correlated 
not with a separate criterion, as in empirical validity, but with itself. 


The fewer the possible grades on the scale of a test, the easier it 
is to attain high reliability. The extreme case is a pass-fail test with 
a single cutoff point between passing and failing. Most students will be 
either far above or far below the cutoff point and thus assure high 
reliability since only those that are close to the cutoff point are likely 
to fluctuate. 


The OIT rating scale is based on nine effective slots, 0+, 1 and l+, 
2 and 2+, 3 and 3+, and 4 and 4+. It is not difficult to attain high 
reliability with such a scale. 1f scores were distributed over fifty or a 
nunca points on the scale, we would expect the reliability of the OIT to 
be lower. 


The nine-point scale is apparently satisfactory for present govern- 
ment users of the test. For academic purposes, however, it is too coarse 
and tends to bunch up scores around the 1 and 1+ ratings, masking progress 
within and between them. The nine-point scale is a weakness also for 
control-type research because it tends to flatten out significant 
differences in achievement in the range where most scores fall. 


Wilds (1975), while staunchly affirming, "The fact of the matter is 
that this system works," admits that 


Even in languages in which tests are conducted frequently as 
French and Spanish, where there is no doubt that standards are 
internalized and elicitation techniques are mastered, it is 
possible for criteria to be tightened or relaxed unwittingly 
over a period of several years so that ratings in the two 
languages are not equivalent or that current ratings are 
discrepant from those of earlier years. 


and 


It is, however, very much an in-house system which depends 
heavily on having all interviewers under one roof, able ta 
consult with each other and share training advances in 
techniques or solutions to problems of testing as they are 
developed and subject to periodic monitoring. It is most apt to 
break down as a system when examiners are isolated by spending 
long periods away from home base (say a two-year overseas 
assignment), by testing in a language no one else knows, or by 
testing so infrequently or so independently that they evalve 
their own system. (p.35) 


-122- 


The fact that two examiners are required to rate the DIT indicates 
lack of confidence in the ratinc by one examiner. This compares 
unfav-rablv with standard practice in testing, which as a rule relies on 


one scorer. Because of weaknesses ın relicc-lity, Me p 

two examiners should be maintained if practical from the point of view of 
trained personnel and cost. Dyson (1972) Found that a shorter examination 
with team marking was better than a longer test with a single marker. 


Scorability 


The subjective nature of the D1T scoring is one of its weaknesses in 
its present form and use. According to Clark (1975), it takes four full 
days to train an examiner. And Wilds (1975) indicates, as quoted above, 
that examiners who are out in the field for two years must be retrained. 
The CIA has its two examiners rate the interview separately, and aversges 
the ratings on a scale. The FSI has the interviewers discuss their 
differences to arrive at an agreement. These are indications that scoring 
the DIT is difficult and subjective to a significant degree. Improvement e 
in this area is obviously desirable. 


A standard way to improve objectivity in scoring is to identify the 
measurable parameters of competence. The rating scales for accent, 
grammar, vocabulary, fluency, and comprehension reported by Wilds (1975) 
represent an effort in this direction. One may be puzzled, however, by 
the weights of the different components: three points to grammar, two to 
vocabulary, one to fluency, two to comprehension, and zero to accent. 


This cannot mean that pronunciation is not an important factor in 
speaking. .Pronunciation contributes to intelligibility even though 
redundancy resolves many inaccuracies in pronunciation . Furthermore, 
sociolinguistic studies show that foreign language accentedness and social 
dialect markedness are perceived and judged by native speakers very 
quickly. A speaking test must, therefore, be considered incomplete until 
pronunciation is taken into account, either on a complex scale showing 
foreign and social dialect dimensions or on an inventory of pronunciation 
features or phonemes and sequences. And if this makes the DIT too 
difficult to score by available examiners, it should be supplemented with © 
a pronunciation test of some kind to give us a better picture of speaking 
skill. 


Practicality 


Practicality must be considered in conjunction with the particular 
uses intended for the DIT. The FSI, CIA, Peace Corps, and other agencies 
and organizations that have the trained personnel on hand and can keep 
careful control of ratings find the OIT practical. The estimated cost of 
$35 per examination (Jones, 1975, p.9) and the fifteen interviews that can 
be-administered by-a team of two -examiners in-a-working day—(Clark,—1975, 


p.20) are also acceptable to those users. А twenty-minute interview by 


-123- 


two trained examiners limits the use of the OIT in university and high 
sencoi setiangs for practical reasons. It would take a team of examiners 
a full working week and two additional days to test 100 students, a not 
uncommon task in those settings. 


If the DIT were shortened to, say, five minutes, its practicality 
would be significantly enhanced. If, in acdition, а single examiner were 
used, subject to checking by a second examiner when challenged, a further 
improvement in practicality would be effected. 


The OIT as a Listenina Comprehension Test 


The OIT shows obvious weaknesses as а listening comprehension 
instrument. In the interview that I analyzed from a recording, the 
examiners asked Fifty-five questions and the examinee required 
clarification only once. In speaking, however, the examinee did not 
ask any questions. The speaking sample was exclusively expository and 
narrative. In listening comprehension it was all questions and no 
narration or exposition. This represents a weakness in content-of -sample 
validity. Furthermore, it is doubtful that any careful check could have 
been kept on comprehension, since attention was on speaking. 


Kaufman (1969) compared the S-ratings of forty-four Peace Corps 
volunteers on the QIT with their listening comprehension scores on the 
Pictorial Auditory Comprehension Test (PACT) developed for the Peace Corps 
oy John B. Carroll. PACT is a seventy-five item multiple-choice test 
that uses four pictures as alternatives for each item. The tests were 
administered after a nine-week intensive course in Spanish conducted in 
Puerto Rico. The interviews were administered by Kaufman shortly after he 
was recertified by the Foreign Service Institute to administer the OIT in 
Spanish to Peace Corps volunteers. Kaufman was assisted throughout the 
oral testing by a Puerto Rican and a Colombian, who had not been involved 
in the training of these volunteers. 


The S-ratings on the DIT and the listening comprehension (LC) scores 
on PACT are presented in Table 1. The correlation between the two sets of 
scores, using the Pearson product-moment linear correlation formula, 
was .83. This is fairly high and could be used to compare performances by 


groups of similar students. Looking into a comparison of performance by 
individuals, however, à different picture emerges. 


Dividing the PACT scale into nine intervals to parallel the nine OIT 
ratings, and equating the two scales at their modes, (the slots with the 
largest number of scores in eacn scale), we note that 68 percent of the 
students who rated within the five levels 0, 1, 2» 3, and 4 (without 
separating the 0+, 1+, etc.) also rated within the corresponding double 
intervals on the PACT scores, while 32 percent were either above or 
below. Using the full nine-point scale on both the 01T and РАСТ, 36 
percent of the students remained in the same slot and 64 percent were 


either above or below. 


-124- 
TABLE 1 


Spanish OIT S-Ratings & PACT LC Scores of 44 Peace Corps Volunteers 


PACT LC 
Scores 


ree 71 [a+ L 2+ 
+) 70 dw e 
65 


68 
(4) 67 
66 


(3+) 62 
(3). 57 
(2+) 52 


le 


(1+) 42 


(D+) 32 


þa 


PRE ЕЕ 


ro 


pe 


fe 


do 


H o e 


[o 


---0115---- AAA A A E  --- 


Cor o RATINGS GF u Ce E 2+ 2 Te 


*Indicates whet the LC rating would have been if measured by PACT. 


-125- 


In other words, if we use the DIT speaking ratings to predict PACT 
listening comprehension performance using a nine-slot rating scale, we 
are off by at least one level in approximately two-thirds of the cases, 
indicating that the OIT S-ratings are not satisfactory measures of 
listening comprehension. The reverse would also be true; that is, if we 
use PACT listening comprehension scores to predict speaking performance in 
terms of 017 ratings, не are off by at least one level ın approximately 
two-thirds of the cases, indicating that PACT listening comprehension 
scores are not valid measures of speaking performance. This is further 
confirmed by looking at some specific cases. We notice, far example, that 
one student rated 2+ by the OIT would be rated 4+ by PACT. Another 
student, with OIT 2, would rate PACT 4. Anda third student, with OIT 1, 


would rate PACT 2+. 


Consequently, since a listening comprehension test can be admin- 
istered with ease to individuals as well as groups by examiners with 
standard training, and since results are scored objectively and quickly, 
separate listening comprehension tests are to be preferred in all cases in 
which examinees are willing to submit to them. 


What the OIT Does and Does Not Do Well and what to Do about It 


Selecting and condensing some of the above considerations, it is not 
unreasonable to state the following conclusions and recommendations. 


1. The OIT is the best available test to obtain a valid speaking sample. 
1t should, therefore, be retained when the necessary requirements with 
regard to personnel training and availability and budget provisions 

are present. 


2. The representativeness of the speaking sample is less satisfactory 

than that of professionally prepared tests of listening comprehen- 

sion, reading» and writing. Therefore, the OIT should be further 
structured to ensure better sampling of linguistic, situational, 
and sociolinguistic components, ОГ it should be supplemented by 
other tests that are more effective in those areas. The OIT could 
then be shortened to a more practical and uniform length. 


3. Scoring of the OIT is unusually difficult and must be presumed 

uneven under ordinary testing conditions. This problem can be 

minimized by not relying exclusively on the DIT but supplement ing 
it instead with other objective tests. 


4. The OIT is not a good test of listening comprehension by psychometric 
standards. It should, therefore, not be used as 8 measure of that 
skill. Listening comprecension tests are far superior and can be 

administered individually as well as in groups at a fraction of the 
cost of the OIT and with lower demands on personnel training. 


-126- 


5. The OIT is not a practical test of competence on internalization 
of grammar, vocabulary, and pronunciation, because of sampling 
and scoring problems. Therefore, it should be supplemented whenever 
possible with tests of those componenis when 


necessary. 


é. The DIT is not a test of reading or writing and should not be used 
as a measure of those skills. This is stated to counter any claim 
that language competence is general in nature and need not be tested 
in its different manifestations. 


7. Since the DIT is difficult to administer and score, and because it 
requires highly trained personnel not always available, it should 
be restricted to VIPs who might not be willing to submit to other 
types of tests. For wider use, a short version of the OIT with 
more limited goals, supplemented by additional tests, 18 
recommended. 


Conclusion 


To the query whether we are asking too much of the OIT in its present 
form, the answer is yes. Therefore, we should either ask less of the 
interview and supplement it with tests that are better adapted to some of 
the components, or, rejecting that, we should extend the interview and 
structure it so it will provide a better sample of linguistic, situa- 
tional, and sociolinguistic competence. 


More specifically, in this observer's opinion, we should keep the DIT 
since it is a valid test of speaking and supports teaching and evaluation 
of speaking, but we should make it shorter, more uniform in length, and 
supplement it with tests of listening comprehension, reading, grammar, 
vocabulary, pronunciation, and writing for a more complete picture of 
competence. We should also increase the number of subcategories under 
each rating so as to reflect more adequately the vast achievement that 
mastery of a second language represents. 


-127- 


References 


Beardsmore, H. Baetens. "Testing Oral Fluency." IRAL 12 (1974): 317-25. 


Clark, John L. D. "Theoretical and Technical Considerations in Oral 
Proficiency Testing." In Testing Language Proficiency, edited by 
Randall L. Jones and Bernard Spolsky, pp. 10-24. Arlington, Va.: 
Center for Applied Linguistics, 1975. 


Coward, D. À. "Confessions of an Oral Examiner." Modern Languages 68 
(1977): 35-38. 


Davison, J. M., and Geake, P. M. "An Assessment of Oral Testing Methods 
in Modern Languages." Modern Languages 51 (1970): 116-25. 


DeCesaris, Janet. "The FSI Interview." Unpublished term paper with 
cassette recording, Georgetown University, 1977. 


Dyson, À. P. "Dral Examining in French." Modern Language Journal 53 
(June 1972): 54-55. 


Gaies, S. J.; Gradman, Н, L.; and Spolsky, В. "Toward the Measurement of 
Functional Proficiency: Contextualization of the Noise Test." TESOL 
Quarterly 11 (1977): 51-57. 


Johansson, 5. "An Evaluation of the Noise Test--A Method for Testing 
Overall Second Language Proficiency by Perception Under Masking 
Noise." IRAL 11 (1973): 107-33. 


Jones, Randall L., and Spolsky, Bernard, eds. Testing Language 
Proficiency. Arlington, Va.: Center for Applied Linguistics, 
1975. 


Kaufman, David. "Comparison of Speaking Proficiency with Auditory 
Comprehension--An Experiment." Unpublished term paper, Georgetown 
University, 1969. 


Politzer, Robert; Hoover, Mary Rhodes; and Brown, Dwight. "Test of 
Proficiency in Black Standard and Nonstandard Speech.” TESOL 
Quarterly 8 (1974): 27-35. 


Rey, Alberto. "A Study of the Attitudinal Effect of a Spanish Accent 

on Blacks and Whites in South Florida.” Unpublished doctoral 

dissertation, Georgetown University School of Languages and 
Linguistics, 1974. 


Shuy, Roger W. "Sociolinguistics." In Linguistic Theory: What Can It 
Say about Reading?, edited by Roger Shuy, pp. 90-94. Newark, Del.: 
International Reading Association, 1977. 


EOS RT VU IND E xg Ec 


-128- 


Spolsky, Bernard; Sigurd, Bengt; Soko, Masahito; Walker, Edward; and 
Arterburn, Catherine. "Preliminary Studies in the Development of 
Techniques for Testing Overall Second Language Proficiency." 


Language Learning 18 (August 1968): 79-101. 


Wilds, Claudia P. “The Oral Interview Test." In Testing Lanquao® 


Proficiency, edited by Randall L. Jones and Bernard Spolsky, 
pp. 29-38. Arlington, Va.: Center for Applied Linguistics, 


1975. 


